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Abstract 

The presence of occluders significantly impacts object recognition accuracy. However, occlusion is typically treated as an unstruc¬ 
tured source of noise and explicit models for occluders have lagged behind those for object appearance and shape. In this paper we 
describe a hierarchical deformable part model for face detection and landmark localization that explicitly models part occlusion. 
The proposed model structure makes it possible to augment positive training data with large numbers of synthetically occluded 
instances. This allows us to easily incorporate the statistics of occlusion patterns in a discriminatively trained model. We test the 
model on several benchmarks for landmark localization and detection including challenging new data sets featuring significant 
occlusion. We find that the addition of an explicit occlusion model yields a detection system that outperforms existing approaches 
for occluded instances while maintaining competitive accuracy in detection and landmark localization for unoccluded instances. 
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1. Introduction 

Accurate localization of facial landmarks provides an im¬ 
portant building block for many applications including iden¬ 
tification (H and analysis of facial expressions O. Signifi¬ 
cant progress has been made in this task, aided in part by the 
fact that faces have less intra-category shape variation and lim¬ 
ited articulation compared to other object categories of interest. 
However, feature point localization tends to break down when 
applied to faces in real scenes where other objects in the scene 
(hair, sunglasses, other people) are likely to occlude parts of the 
face. Fig.l^a) depicts the output of a deformable part model O 
where the presence of occluders distorts the final alignment of 
the model. 

A standard approach to handling occlusion in part-based 
models is to compete part feature scores against a generic back¬ 
ground model or fixed threshold (as in Fig. Bb)). However, 
setting such thresholds is fraught with difficulty since it is hard 
to distinguish between parts that are present but simply hard to 
detect (e.g., due to unusual lighting) and those which are gen¬ 
uinely hidden behind another object. 

Treating occlusions as an unstructured source of noise ig¬ 
nores a key aspect of the problem, namely that occlusions are 
induced by other objects and surfaces in the scene and hence 
should exhibit occlusion coherence. For example, it would 
seem very unlikely that every-other landmark along an object 
contour would happen to be occluded. Yet many occlusion 
models make strong independence assumptions about occlu¬ 
sion, making it difficult to distinguish a priori likely from un¬ 
likely patterns. Ultimately, an occluder should not be inferred 
simply by the lack of evidence for object features, but rather by 
positive evidence for the occluding object that explains away 
the lack of object features. 

The contribution of this paper is an efficient hierarchical de- 



Figure 1: Occlusion impacts part localization performance. In panel (a) the 
output of a deformable part model 0 is distorted by the presence of occluders, 
disrupting localization even for parts that are far from the site of occlusion, 
(b) Introducing independent occlusion of each part results in better alignment 
but occlusion is treated as an outlier process and prediction of occlusion state 
is inaccurate, (c) The output of our hierarchical part model, which explicitly 
models likely patterns of occlusion, shows improved localization as well as 
accurate prediction of which landmarks are occluded. 


formable part model that encodes these principles for modeling 
occlusion and achieves state-of-the-art performance on bench¬ 
marks for occluded face localization and detection (depicted in 
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Fig. [Jc)). Building on our previously published results HI, 
we model the face by an arrangement of parts, each of which 
is in turn composed of local landmark features. This two-layer 
model provides a compact, discriminative representation for the 
appearance and deformations of parts. It also captures the cor¬ 
relation in shapes and occlusion patterns of neighboring parts 
(e.g., if the chin is occluded it would seem more likely the bot¬ 
tom half of the mouth is also occluded). In addition to repre¬ 
senting the face shape, each part has an associated occlusion 
state chosen from a small set of possible occlusion patterns, en¬ 
forcing coherence across neighboring landmarks and providing 
a sparse representation of the occluder shape where it intersects 
the part. We describe the details of this model in Section 

Specifying training data from which to learn feasible occlu¬ 
sion patterns comes with an additional set of difficulties. Practi¬ 
cally speaking, existing datasets have focused primarily on fully 
visible faces. Moreover, it seems unlikely that any reasonable 
sized set of training images would serve to densely probe the 
space of possible occlusions. Beyond certain weak contextual 
constraints, the location and identity of the occluder itself are 
arbitrary and largely independent of the occluded object. To 
overcome this difficulty of training data, we propose a unique 
approach for generating synthetically occluded positive train¬ 
ing examples. By exploiting the structural assumptions built 
into our model, we are able to include such examples as “vir¬ 
tual training data” without explicitly synthesizing new images. 
This in turn leads to an interesting formulation of discriminative 
training using a loss function that depends on the latent occlu¬ 
sion state of the parts for negative training examples which we 
describe in Section (4] 

We carry out an extensive analysis of this model perfor¬ 
mance in terms of landmark localization, occlusion prediction 
and detection accuracy. While our model is trained as a de¬ 
tector, the internal structure of the model allows it to perform 
high-quality landmark localization, comparable in accuracy to 
pose regression, while being more robust to initialization and 
occlusions (Section 1^. To carry out an empirical compari¬ 
son to recently published models, we provide a new set of 68- 
landmark annotations for the Caltech Occluded Faces in the 
Wild (COFW) benchmark dataset. We hnd that not only the 
localization but also the prediction of which landmarks are oc¬ 
cluded is improved over simple independent occlusion models 
(Section [5^ . Unlike landmark regression methods, our model 
does not require initialization and achieves good performance 
on standard face detection benchmarks such as FDDB 0 . Fi¬ 
nally, to illustrate the impact of occlusion on existing detec¬ 
tion models, we evaluate performance on a new face detection 
dataset that contains signihcant numbers of partially occluded 
faces (Section [53] ). 

2. Related Work 

Face Detection and Localization. There is a long history of 
face detection in the computer vision literature. A classic ap¬ 
proach treats detection as problem aligning a model to a test im¬ 
age using techniques such as Deformable Templates (SI, Active 


Appearance Models (AAMs) |[7l[8l|9l and elastic graph match¬ 
ing ifTOl . Alignment with full 3D models provides even richer 
information at the cost of additional computation (HKIl- A 
key difficulty in many of these approaches is the dependence 
on iterative and local search techniques for optimizing model 
alignment with a query image. This typically results in high 
computational cost and the concern that local minima may un¬ 
dermine system performance. 

Recently, approaches based on pose regression, which train 
regressors that predict landmark locations from both appear¬ 
ance and spatial context provided by other detector responses, 
has also shown impressive performance cainiiiiiBiiiiiiii 
EllISlEol. While these approaches lack an explicit model of 
face shape, stage-wise pose-regression models can be trained 
efficiently in a discriminative fashion and thus sidestep the op¬ 
timization problems of global model alignment while providing 
fast, feed-forward performance at test time. 

Pose-regression is flexible in the choice of features and re¬ 
gressors used. Supervised Descent Method (SDM) lIT^ em¬ 
ploys linear regression on SIFT features to compute shape in¬ 
crements. ESR lfT6l and RCPR 03 predict shape increments 
using simple pixel-difference features and boosted ferns. LBF 
03 learns a set of binary features and a regression function 
using random forest regression. Zhu et al. proposed a Coarse- 
to-Fine Shape Searching method (CFSS) 1^ in which at each 
stage a cascade of linear regressors are used to calculate a hner 
sub-space (represented as a center and scope). The incorpo¬ 
ration of Deep Convolutional Neural Network features has al¬ 
lowed further improvements by using raw image pixels as input 
instead of hand-designed features and allows end-to-end train¬ 
ing. Zhang et al. proposed successive auto-encoder networks 
(CFAN) to perform coarse-to-hne alignment 11211 . TCDCN fT2\ 
train a multi-task DCNN jointly for landmark localization along 
with prediction of other facial attributes. They show that facial 
attributes such as gender and expression can help in learning a 
robust landmark detector. 

Our model is most closely related to the work of m, which 
applies discriminatively trained deformable part models (DPM) 
12^ to face analysis. This offers an intermediate between the 
extremes of model alignment and landmark regression by uti¬ 
lizing mixtures of simplihed shape models that make efficient 
global optimization of part placements feasible while exploit¬ 
ing discriminative training criteria. Similar to |[24|, we use lo¬ 
cal part and landmark mixtures to encode richer multi-modal 
shape distributions. We extend this line of work by adding hi¬ 
erarchical structure and explicit occlusion to the model. We 
introduce intermediate part nodes that do not have an associ¬ 
ated “root template” but instead serve to encode an intermediate 
representation of occlusion and shape state. The notion of hier¬ 
archical part models has been explored extensively as a tool for 
compositional representation and parameter sharing (see e.g., 
Esmi)). While the intermediate state represented in such mod¬ 
els can often be formally encoded in by non-hierarchical mod¬ 
els with expanded state spaces and tied parameters, our experi¬ 
ments show that the particular choice of model structure proves 
essential for efficient representation and inference. 
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Figure 2: Our model consists of a tree of parts (black circles) each of which is connected to a set of landmarks (green or red) in a star topology. The examples here 
show templates corresponding to different choices of part shape and occlusion patterns. Red indicate occluded landmarks. Shape parameters are independent of 
occlusion state. Landmark appearance is modeled with a small HOG template (2nd row) and occluded landmarks are constrained to have an appearance template 
fixed to 0. Note how the model produces a wide range of plausible shape configurations and occlusion patterns. 


Occlusion Modeling. Modeling occlusion is a natural fit for 
recognition systems with an explicit representation of parts. Work 
on generative constellation models (271 Ell learned parameters 
of a full joint distribution over the probability of part occlusion 
and relied on brute force enumeration for inference, a strategy 
that doesn’t scale to large numbers of landmarks. More com¬ 
monly, part occlusions are treated independently which makes 
computation and representation more efficient. For example, 
the supervised detection model of (^ associates with each part 
a binary variable indicating occlusion and learns a correspond¬ 
ing appearance template for the occluded state. 

The authors of (^ impose a more structured distribution 
on the possible occlusion patterns by specifying grammar that 
generates a person detector as a variable length vertical chain 
of parts terminated by an occluder template, while (301 allows 
“fiexible compositions” which correspond to occlusion patterns 
that leave visible a connected subgraph of the original tree- 
structure part model. Our approach provides a stronger model 
than full independence, capturing correlations between occlu¬ 
sions of non-neighboring landmarks. Unlike the grammar-based 
approach, occlusion patterns are not specified structurally but 
instead learned from data and encoded in the model weights. 

Pose regression approaches have also been adapted to in¬ 
corporate explicit occlusion modeling. For example, the face 
model of ED uses a robust m-estimator which serves to trun¬ 
cate part responses that fall below a certain threshold. In our 
experiments, we compare our results to the recent work of da 
which uses occlusion annotations when training a cascade of 
regressors where each layer predicts both part locations and oc¬ 
clusion states. 

3. Hierarchical Part Model 

In this section we develop a hierarchical part model that si¬ 
multaneously captures face appearance, shape and occlusion. 
Fig. shows a graphical depiction of the model structure. The 
model has two layers: the face consists of a collection of parts 


(nose, eyes, lips) each of which is in turn composed of a num¬ 
ber of landmarks that specify local edge features making up the 
part. Landmarks are connected to their parent part nodes with a 
star topology while the connections between parts forms a tree. 
In addition to location, each part takes one of a discrete set of 
shape states (corresponding to different facial shapes or expres¬ 
sions) and occlusion states (corresponding to different patterns 
of visibility). The model topology which groups facial features 
into parts was specified by hand while the shape and occlusion 
patterns are learned automatically from training data (see Sec¬ 
tion]^. This model, which we term a hierarchical part model 
(HPM) is a close cousin of the deformable part model (DPM) 
of (23l and the fiexible part model (FMP) of O. It differs in 
the addition of part nodes that model shape but don’t include 
any “root filter” appearance term, and by the use of mixtures to 
model occlusion patterns for each part. In this section we in¬ 
troduce some formal notation to describe the model and some 
important algorithmic details for performing efficient message 
passing during inference. 

3.1. Model Structure 

Let /, s, o denote the hypothesized locations, shape and oc¬ 
clusion of Np parts and Ni landmarks describing the face. Lo¬ 
cations / G range over the whole image domain and o G 
(Di X (92 ... X (DAT indicates the occlusion states of parts and 
landmarks and N = Np^ Ni. The shape s G x ^2 ... x 
selects one of a discrete set of shape mixture components for 
each part. We define a tree-structured scoring function by: 

(^>2 (/^, |/) (1) 

+ '^ij {h : Ij i^i^Sj) + hij (Si , Sj , Oi , Oj ) 

i jEchild(i) 

where the potential 0 scores the consistency of the local image 
appearance around location li, ^ is a. quadratic shape deforma¬ 
tion penalty, and 6 is a co-occurrence bias. 
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The first (unary) term scores the appearance evidence. We 
linearly parameterize the unary appearance term with filter weights 

that depend on the discrete shape mixture selected 

Appearance templates are only associated with the leaves (land¬ 
marks) in the model so the unary term only sums over those leaf 
nodes. The occlusion variables Oi for the landmarks are binary, 
corresponding to either occluded or visible. If the ith landmark 
is unoccluded, the appearance feature 0 is given by a HOG 1^ 
feature extracted at location k, otherwise the feature is set to 
0. This is natural on theoretical grounds since the appearance 
of the occluder is arbitrary and hence indistinguishable from 
background based on its local appearance. Empirically we have 
found that unconstrained occluder templates learned with suf¬ 
ficiently varied data do in fact have very small norms, further 
justifying this choice m. 

The second (pairwise) term in Eq. 1 scores the placement 
part j based on its location relative to its parent i and the shape 
mixtures of the child and parent. We model this with a linearly 
parameterized function: 

'ipij{li,lj,Si,Sj) = • V’(^i - Ij) 

where the feature includes the x and y displacements and 
their cross-terms, allowing the weights Wij to encode a stan¬ 
dard quadratic “spring”. We assume that the shape of the parts 
is independent of any occluder so the spring weights do not 
depend on the occlusion states. Q The pairwise parameter hij 
encodes a bias of particular occlusion patterns and shapes to 
co-occur. Eormally, each landmark has the same number of oc¬ 
clusion states and shape mixtures as its parent part, but we fix 
the bias parameters between the part and its constituent land¬ 
marks to impose a hard constraint that the mixture assignments 
are compatible. 

3.2. Efficient Message Passing 

The model above can be made formally equivalent to the 
EMP model used in 1241 by introducing local mixture variables 
that live in the cross-product space of Oi and Si. However, this 
reduction fails to exploit the structure of the occlusion model. 
This is particularly important due to the large size of the model. 
Naive inference is quite slow due to the large number of land¬ 
marks and parts (N=68 -f 10), and huge state space for each node 
which includes location, occlusion pattern and shape mixtures. 
Consider the message passed from one part to another where 
each part has L possible locations, S shape mixtures and O oc¬ 
clusion patterns. In general this requires minimizing over func¬ 
tions of size {LSOY ovL{SOY when using the distance trans¬ 
form. In the models we test, SO = 12 which poses a substantial 
computation and memory cost, particularly for high-resolution 
images where L is large. 


Tn practice we find it is sufficient for the deformation cost to only depend 
on the child shape mixture, i.e. 'ipijQi, Ij^Si, sj) = w^j • 'ipik — Ij) which 
gives a factor S speedup with little decrease in performance. 



Figure 3: Virtual positive examples are generated synthetically by starting 
with a fully visible training example and sampling random coherent occlusion 
patterns. 


Part-Part messages. While the factorization of shape and oc¬ 
clusion doesn’t change the asymptotic complexity, we can re¬ 
duce the runtime in practice by exploiting distributivity of the 
distance transform over max to share computations. Standard 
message passing from part j to part i requires that we compute: 


lij^i{li,Si,Oi) = max 

Ij ^Oj 


'^ij {h^ Ij 5 <^2 5 Sj ) 


^ ^ l^k^j ijj •) 5 ) “1“ ^ij 1 5 1 ) 

kEchild(j) 


where we have dropped the unary term cpj which is 0 for parts. 
Since the bias doesn’t depend on the location of parts we can 
carry out the computation in two steps: 


k'ijQj , Si^ Sj ^Oj) — 


max 

L 


{h 5 Ij -) ^j) ^ ^ t^k^j {}j 1 ^j •) Oj ) 

kEchild(j) 


l^j^i 1 ^i) — max \yij{Ij , Si , Sj fOj^ -j- hij ( 5 ^, Sj , Oi , Oj )] 

Sj ,Oj 

which only requires computing S‘^0 distance transforms. 

Landmark-Part messages. In our model the occlusion and shape 
variables for a landmark are determined completely by the par¬ 
ent part state. Since the score is known for an occluded land¬ 
mark in advance, it is not necessary to compute distance trans¬ 
forms for those components. We write this computation as: 

. >1 _ / 0 if k occluded in oy 

- I 4, sj,Sj) + Mh, sj, 0 j |7) 

d'k^j {Jj -> ^j 5 ) — ^j /c (^/c 5 : ^j ) {^jk ('^jf 5 ^j 5 ^j 5 ^j ) 

Where we have used the notation to explicitly capture the con¬ 
straint that landmark shape and occlusion mixtures (sk^Ok) must 
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match those of the parent part (sj^Oj). In our models, this re¬ 
duces the memory and inference time by roughly a factor of 2, 
a savings that becomes increasingly significant as the number 
of occlusion mixtures grows. 

3.3. Global Mixtures for Viewpoint and Resolution 

Viewpoint and image resolution are the largest sources of 
variability in the appearance and relative location of landmarks. 
To capture this, we use a mixture over head poses. These “global” 
mixtures can be represented with the same notation as above by 
expanding the state-space of the shape variables to be the cross 
product of the set of local shapes for part i and the global view¬ 
point for the model (i.e., Si e Si x V) and fixing some entries 
of the bias bij to be — oo to prevent mixing of local shapes from 
different viewpoints. In our implementation we tie parameters 
to enforce the left- and right-facing models to be mirror sym¬ 
metric. 

The HPM model we have described includes a large num¬ 
ber of landmarks. While this is appropriate for high resolution 
imagery, it does not perform well in detecting and modeling 
low resolution faces (< 150 pixels tall). To address this we in¬ 
troduce an additional global mixture component for each view¬ 
point that corresponds to low-resolution HPM model consisting 
of a single half-resolution template for each part and no land¬ 
mark templates. This mixture is trained jointly with the full 
resolution model using the strategy described in 1341 . 

4. Model Training and Inference 

The potentials in our shape model are linearly parameter¬ 
ized, allowing efficient training using an SVM solver 1^ . Face 
viewpoint, landmark locations, shape and occlusion mixtures 
are completely specified by pre-clustering the training data so 
that parameter learning is fully supervised. We first describe 
how these supervised labels are derived from training data and 
how we synthesize “virtual” positive training examples that in¬ 
clude additional occlusion. We then discuss the details of the 
parameter learning and test-time prediction. 

4.1. Training Data 

We assume that a training data set of face images has been 
annotated with landmark locations for each face. From such 
data we automatically generate additional mixture labels spec¬ 
ifying viewpoint, shape, and occlusion. We also generate ad¬ 
ditional virtual training examples by synthesizing plausible co¬ 
herent occlusion patterns. 

Viewpoint and Resolution Mixtures. To cluster training exam¬ 
ples into a set of discrete viewpoints, we make use of the Multi- 
PIE dataset which provides ground-truth viewpoint anno¬ 
tations for a limited set of faces. We perform Procrustes align¬ 
ment between each training example and examples in the Multi- 
PIE database and then transfer the viewpoint label from nearest 
MultiPIE example to the training example. In our experiments 
we used either 3 or 7 viewpoint clusters (each viewpoint spans 
15 degrees). In addition to viewpoint, alignment to MultiPIE 



Figure 4: Example shape clusters for face parts (nose, upper lip, lower lip). Co¬ 
occurrence biases for combinations of part shapes are learned automatically 
from training data. Different colored points correspond to location of each 
landmark relative to the part (centroid). 

also provides a standard scale normalization and removes in¬ 
plane rotations from the training set. To train the low-resolution 
mixture components, we use the same training data but down- 
sample the input image by a factor of 2. 

Part Shape and Occlusion Mixtures. For each part and each 
viewpoint, we cluster the set of landmark configurations in the 
training data in order to come up with a small number of shape 
mixtures for that part. The part shapes in the final model are 
represented by displacements relative to a parent node so we 
subtract off the centroid of the part landmarks from each train¬ 
ing example prior to clustering. The vectors containing the 
coordinates of the centered landmarks are clustered using k- 
means. We imagine it would be efficient to allocate more mix¬ 
tures to parts and viewpoints that show greater variation in shape, 
but in the final model tested here we use fixed allocation of 
S = 3 shape mixtures per part per viewpoint. Fig. shows 
example clusterings of part shapes for the center view. 

Synthetic Occlusion Patterns. In the model each landmark is 
fully occluded or fully visible. The occlusion state of a part de¬ 
scribes the occlusion of its constituent landmarks. If there are 
Ni landmarks then there are 2^^ possible occlusion patterns. 
However, many of these occlusions are quite unlikely (e.g., ev¬ 
ery other landmark occluded) since occlusion is typically gen¬ 
erated by an occluder object with a regular, compact shape. 

To model spatial coherence among the landmark occlusions, 
we synthetically generate “valid” occlusions patterns by first 
sampling mean part and landmark locations from the model 
and then randomly sampling a quarter-plane shaped occluder 
and setting as occluded those landmarks that fall behind the oc¬ 
cluder. Let a, 6 be uniformly sampled from a tight box sur¬ 
rounding the face. This selected origin point induces a par¬ 
tition of the image into quadrants (i.e., {x < a) A {y < b), 
{x > a) A {y < b), etc.). We choose a quadrant at random 
and mark all landmarks falling in that landmark as occluded. 
While our occluder is somewhat “boring”, it is straightforward 
to incorporate more interesting shapes, e.g., by sampling from 
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a database of segmented objects. Fig. shows example occlu¬ 
sions generated for a training example. 

In our experiments we generate 8 synthetically occluded ex¬ 
amples for each original training example. For each part in the 
model we cluster the set of resulting binary vectors in order to 
generate a list of valid part occlusion patterns. The occlusion 
state for each landmark in a training example is then set to be 
consistent with the assigned part occlusion pattern. In our ex¬ 
periments we utilized only 0 = 4 occlusion mixtures per part, 
typically corresponding to unoccluded, fully occluded and two 
half occluded states whose structure depended on the part shape 
and location within the face. 

4.2. Parameter learning 

Recall that our model (Eqn. is parameterized by a set of 
weights and biases, which we collect into a parameter vector 
w. Each weight is multiplied by some corresponding feature 
that depends on the hypothesized model configuration {l^s^o) 
and input image /. Collecting these features a feature vector 
5, o|/), we write the scoring function as an inner product 
with the model weights Q(l^ s,o) = w • ^(l, s, o\I). We learn 
the model weights using a regularized SVM objective: 

W Z ^^ 

t 

>l-r]t WteV 

u; • T^(/, 5, o\I^) < —(1 — m6{o) — rjt) V/, s,o Mt 

where denotes the supervised model configuration 

for a positive training example, 6{o) is a margin scaling func¬ 
tion that measures the fraction of occluded landmarks and C 
and m are hyper-parameters (described below). The constraint 
on positive images t e V encourages that the score of the cor¬ 
rect model configuration be larger than 1 and penalizes viola¬ 
tions using slack variable rjt. The second constraint encourages 
the score to be low on all negative training images t for all 
configurations of the latent variables. 

Margin scaling for occlusion. This formulation differs from 
standard supervised DPM training in the treatment of nega¬ 
tive training examples. Since landmarks can be occluded in 
our model, fully or partially occluded faces can be detected by 
our model in the negative training images. These images do 
not contain any faces and we would like our model generates 
low scores for these detections. However, a landmark which is 
detected as occluded in a negative image is in some sense cor¬ 
rect. There is no real distinction between a negative image and 
a positive image of a fully occluded face! Thus we penalize 
negative detections (false positives) with significant amounts of 
occlusion less than fully-visible false positives. 

For this purpose, we scale the margin for negative exam¬ 
ples in proportion to the number of occluded landmarks. We 
specify the margin for a negative example as 1 — m6{o), where 
the function 6{o) measures the fraction of occluded landmarks 
and m is a hyper-parameter. As the number of occluded land¬ 
marks increases the margin decreases and the model score for 


that example can be larger without violating the constraint. The 
margin for a fully occluded example is equal to 1 — m. Set¬ 
ting m = 0 corresponds to standard classification where all 
the negatives have the same margin of 1. In this case the bi¬ 
ases learned for occluded landmarks tend to be low (otherwise 
many fully or partially occluded negative examples will violate 
the constraint). As a result, models trained with m = 0 tend 
not to predict occlusion. As we increase m, the scores of fully 
or partially occluded negative examples can be larger without 
violating the constraint and the training procedure is thus free 
to learn larger bias parameters associated with occluded land¬ 
marks. As we show in our experimental evaluation, this results 
in higher recall of occluded landmarks and improved test-time 
performance. 

We use a standard hard-negative mining or cutting-plane ap¬ 
proach to find a small set of active constraints for each neg¬ 
ative image. Given a current estimate of the model parame¬ 
ters w, we find the model configuration (s, / , o) that maximizes 
u;-T^(/, 5, o\P)—mS{o) on a negative window/^. Since the loss 
m6{o) can be decomposed over individual landmarks, this loss- 
augmented inference can be easily performed using the same 
inference procedure introduced in section We simply sub¬ 
tract ^ from the messages sent by occluded landmarks where 
Ni = 68 is the number of landmarks. During training we make 
multiple passes through the negative training set and maintain 
a pool of hard negatives for each image. We share the slack 
variable pt for all such negatives found over a single window 

P. 

4.3. Test-time Prediction 

Scale and In-plane Rotation. We use a standard sliding win¬ 
dow approach to search over a range of locations and scales in 
each test image. In our experiments, we observed that part mod¬ 
els with standard quadratic spring costs are surprisingly sensi¬ 
tive to in-plane rotation. Models that performed well on images 
with controlled acquisition (such as MultiPIE) fared poorly “in 
the wild” when faces were tilted. The alignment procedure de¬ 
scribed above explicitly removes scale and in-plane rotations 
from the set of training examples. At test time, we perform an 
explicit search over in-plane rotations (-30 to 30 degrees with 
an increment of 6 degrees). 

Landmark Prediction. The number of landmarks in our model 
was chosen based on the availability of 68-landmark ground- 
truth annotations. In cases where it was useful to benchmark 
landmark localization of our model on datasets using differ¬ 
ent landmark annotation standards (e.g., COFW 29-landmark 
data), we used additional held-out training data to fit a simple 
linear map from the part locations returned by our hierarchical 
part model to the desired output space. This provided a more 
stable procedure than simpler heuristics such as hand selecting 
a subset of landmarks. 

Let L G be the vector of landmark locations returned 
at the top scoring detection when running the model on a train¬ 
ing example i. Let T e a vector of ground-truth landmark 
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Figure 5: Examples of landmark localization and occlusion estimation for images from the HELEN (row 1) and COFW (rows 2-3) test datasets. Red indicates those 
landmarks which are predicted as being occluded by the HPM. 


locations for that image based on some other annotation stan¬ 
dard (i.t., M ^ Ni). We train a linear regressor 

mm ^|| r -/ 3 ^ r ||2 + A ||/?||2 

i 

where p G x 2 M matrix of learned coefficients and A 

is a regularization parameter. To prevent overfitting, we restrict 
Ppq to be zero unless the landmark p belongs to the same part 
as q. 

To predict landmark occlusion, we carried out a similar map¬ 
ping procedure using regularized logistic regression. However, 
in this case we found that simply specifying a fixed correspon¬ 
dence between the two sets of landmarks based on their average 
locations and transferring the occlusion fiag from the model to 
benchmark landmark space achieved the same accuracy. 

5. Experimental Evaluation 

Figure shows example outputs of the HPM model run on 
example face images. The model produces both a detection 
score and estimates of landmark locations and occlusion states. 
While the possible occlusion patterns are quite limited (4 oc¬ 
clusions patterns per part shape), the final predicted occlusions 
(marked in red) are quite satisfying in highlighting the support 
of the occluder for many instances. We evaluate the perfor¬ 
mance of the model on three different tasks: landmark local¬ 
ization, landmark occlusion prediction, and face detection. In 
our experiments we focus on test datasets that have significant 
amounts of occlusion and emphasize the ability of the model to 
generalize well across datasets. 


5. 1 . Landmark Localization 

Datasets. We evaluate performance of our method and related 
baselines on three benchmark datasets for landmark localiza¬ 
tion: the challenging portion of the IBUG dataset which con¬ 
tains a range of poses and expressions 1^ . a subset of the HE¬ 
LEN dataset |[37l containing occlusions, and the Caltech Oc¬ 
cluded faces in the Wild (COFW) |[T5l dataset. We evaluate 
on IBUG to provide a baseline for localization in the absence 
of occlusion. The latter two datasets were selected to evalu¬ 
ate the ability of our model in the presence of substantial nat¬ 
ural occlusion which is not well represented in many bench¬ 
marks. The authors of ifTSll estimate that COFW contains 23% 
occluded landmarks. Fig. [^depicts selected results of running 
our detector on example images from the HELEN and COFW 
test datasets. 

68 Landmark annotations for COFW. We note there is a va¬ 
riety of annotation conventions across different face landmark 
datasets. COFW is annotated with 29 landmarks while HE¬ 
LEN includes a much denser set of 194 landmarks. The 300 
Faces in-the-wild Challenge (300-W) 1^ has produced sev¬ 
eral unified benchmarks in which HELEN dataset have been 
re-annotated with a set of 68 standard landmarks. To allow 
for a greater range of comparisons and further this standardiza¬ 
tion, we manually re-annotated the test images from the COFW 
dataset with 68 landmarks and occlusion fiags. We also gen¬ 
erated face bounding boxes (using a similar detection method 
that used for the 300-W datasets (381) for evaluating pose re¬ 
gression methods that require initialization. We bootstrapped 
our annotations from the 29-landmark annotations using a cus- 
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(a) Occluded HELEN68 


Average localization error as fraction of interpupillary distance 

(b) COEW29 



(c) COEW68 


Figure 6: Panels show cumulative error distribution curves (the proportion of test images that have average landmark localization error below a given threshold) 
on three test sets: an occlusion rich subset of HELEN, COFW29 and COFW68. The legend indicates the training set (in parentheses), the success rate % at a 
localization threshold of 0.1 and the average error [in brackets]. The HPM shows good localization performance these difficult datasets with significant occlusion. 
In general regression models (dashed lines) have better performance for a low localization threshold compare to part based models (solid lines). However, the 
success rates for regression models increase more slowly and eventually cross over those for part models (solid lines) as the allowable localization error threshold 
increases. 


tom annotation tool. The annotations and benchmarking code 
are publicly availably 

Localization Evaluation Metrics. To evaluate landmark local¬ 
ization independent of detection accuracy, we follow a stan¬ 
dard approach that assumes that detection has already been per¬ 
formed and evaluates performance on cropped versions of test 
images. While our model is capable of both detecting and lo¬ 
calizing landmarks, this protocol is necessary to evaluate pose 
regression methods that require good initialization. We thus 
follow the standard protocol (see e.g., |[36l ) of using the bound¬ 
ing boxes provided for each dataset (usually generated from the 
output of a face detector) by evaluating the localization accu¬ 
racy for the highest scoring detection that overlaps the given 
bounding box by at least 70%. 

We report the average landmark localization error across 
each test set as well as the “success rate”, the proportion of test 
images with average landmark localization error below a given 
threshold. Distances used in both quantities are expressed as a 
proportion of the interpupillary distance (distance between cen¬ 
ters of eyes) specified by the ground-truth. Computing the suc¬ 
cess rate across a range thresholds yields a cumulative error dis¬ 
tribution curve (CED) (Eig. |^. When a single summary num¬ 
ber is desired, we report the success rate at a standard threshold 
of 0.1 interpupillary distance (IPD). 

Training and baselines. To train our model, we used training 
data from LEPW (811 images) and/or HELEN (2000 images) 
annotated with 68 landmarks. The training set is specified in 
parenthesis in figure legends. Erom each training image we 
generate 8 synthetically occluded “virtual positives”. To fit lin¬ 
ear regression coefficients for mapping from the HPM predicted 
landmark locations to 29 landmark datasets, we ran the trained 


^https://github.com/golnazghiasi/cofw68-benchmark 


Method 

average error 

DRMF 1381 

0.1979 

CDM ED 

0.1954 

RCPR ns 

0.1726 

ESR ED 

0.1700 

cfanED 

0.1678 

SDM (m 

0.1540 

CFSS EOl 

0.1200t/0.0998 

TCDCN f22t 

0.1121t/0.0860 

lbfIT^ 

0.1198 

HPM 

0.1310 


Table 1: Average errors as a fraction of IPD on IBUG68 m dataset. Results 
with J/f are obtained by testing the method with the standard detector bounding 
boxes provided by 300-W, using either the published model (f) or retraining (J). 

model on the COEW training data set and fit regression param¬ 
eters P that mapped from the 68 predicted points to the 29 an¬ 
notated. 

Eor diagnostic purposes, we trained several baseline models 
including a version of our model without occlusion mixtures 
(HPM-occ) and the (non-hierarchical) deformable part model 
(DPM) described by O . We also evaluated variants of the 
robust cascaded pose regression (RCPR) described in ca as 
well as their implementation of explicit shape regression (ESR) 
Qa using both pre-trained models provided by the authors and 
models retrained to predict 68 landmarks. Unlike HPM which 
uses virtual occlusion, RCPR requires training examples with 
actual occlusions and corresponding annotations. Eor training 
sets that featured no occlusion, we thus trained a variant that 
does not model occlusion (RCPR-occ). 

Localization Results (Occluded HELEN 68). We evaluated on 
a subset of the HELEN dataset 1371 consisting of 126 images 


^The originally published DPM model of O was trained on the very con¬ 
strained MultiPIE dataset EH. Retraining the model of Zhu et al. and includ¬ 
ing in-plane rotation search at test time yielded significantly better performance 
than reported elsewhere (c.f., ISl) 
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Figure 7: We analyze the landmark localization average error of RCPR, HPM 
and DPM for different overlap ratio with the ground-truth face boxes. For 
RCPR we change the minimum overlap ratio of the initial bounding boxes and 
the ground-truth face boxes. For HPM and DPM, we change the minimum over¬ 
lap threshold of the returned detections and ground-truth boxes. RCPR is very 
sensitive to the amount of overlap and its performance decreases rapidly as the 
overlap ratio decreases. But, HPM and DPM are robust to the overlap threshold 
and they can maintain the same performance over different thresholds. 

which were selected on the basis having some significant amount 
of occlusion We do not report results of the HPM (HE- 
LEN68) model on this dataset as there was overlap between 
training and testing images. Eig.j^a) shows the error distribu¬ 
tion. The HPM achieves an average error of 0.0811, beating out 
the DPM baseline (0.0931) and RCPR-occ (0.0903). Removing 
explicit occlusion from the model (HPM-occ) results in lower 
success rates for a range of thresholds. 

Localization Results (COFW29). To facilitate diagnostic com¬ 
parison to previously published results, we evaluated our model 
on the original COEW 29-landmark test set Ga consisting of 
507 internet photos depicting a wide variety of more difficult 
poses and includes a significant amount of occlusion. Since 
COEW training only contains 29 landmarks (we only performed 
additional annotations on test data), we evaluated models trained 
on LEPW68 and HELEN68. Eig.j^c) shows that HPM achieves 
a significantly lower average error than RCPR and higher suc¬ 
cess rates for all but the smallest (< 0.06) localization success 
thresholds. 

Localization Results (COFW68). We tested our model trained 
on LPPW68 and HELEN68 training data on this benchmark 
and compared with CESS, TCDCN and RCPR-occ (Pig. 

(c)). Por CESS and TCDCN we used the publicly available 
pre-trained models which were trained on HELEN68, LPPW68 
and APW68 (TCDCN is also pretrained on MATE dataset). Por 
RCPR-occ we used the authors’ code to train a model on HE- 
LEN68 and LPPW68 training sets. Note we that couldn’t train 


^https://github.com/golnazghiasi/ 
Occluded-HELEN-image-list 



LFPW (29) 

COFW (29) 

model 

training dataset 

SR 

AE 

SR 

AE 

RCPR-occ 

LPPW29 

88.95 

0.073 

63.44 

0.115 

RCPR-occ 

LPPW29-K 

98.95 

0.038 

63.64 

0.096 

RCPR-occ 

COFW29 

89.01 

0.071 

76.28 

0.091 

RCPR 

COFW29 

91.05 

0.064 

79.25 

0.085 

HPM 

LPPW68,INR- 

97.37 

0.050 

86.76 

0.075 

HPM 

HELEN68,INR- 

98.42 

0.049 

90.71 

0.072 

HPM 

HELEN68,PAS- 

98.95 

0.048 

92.09 

0.070 


Table 2: We find HPM generalizes well across datasets while pose regression 
has a strong dependence on training data. Localization performance is mea¬ 
sured by success rate (SR) and average error (AE). The RCPR model trained 
on COFW performs much better on COFW test data compared to RCPR-occ 
trained on LFPW29+ (79% SR vs 64% SR) but has much worse performance 
on LFPW test data compared to that model (91% SR vs 99% SR). Good perfor¬ 
mance on LFPW also depends heavily on including additional warped positive 
instances (LFPW29+ vs LFPW29). The HPM trained on LFPW68 has high 
success rates on both COFW (87%) and LFPW (%97) test data. Last two rows 
of the table show the performance of HPM when a different training data set 
(HELEN68) is used for training. This dataset has more variation and more im¬ 
ages (1758) compared to LFPW68 (682) and improves performance of HPM on 
both test datasets. Training on more negative images (6000 images from PAS¬ 
CAL) decreases localization error of our model compared to using only INRIA 
negatives. 

the full RCPR 68-landmark model with occlusion since HE- 
LEN68 and LEPW68 do not have occlusion and COEW train is 
only labeled with 29 landmarks. 

Localization Results (IBUG 68). This dataset contains 68 land¬ 
mark annotations for 135 faces in difficult poses and expres¬ 
sion 1361 . Eor testing our method on this dataset, we follow 
previous work and trained our model on combined HELEN68 
and LEPW68 training data provided by 300-W. Since IBUG in¬ 
cludes many side view faces we trained a variant of our model 
with 7 viewpoints. We compare our model with published per¬ 
formance of several state-of-the-art methods in Table[T]and achieve 
comparable performance. 

In addition to reporting values from the published litera¬ 
ture, we also re-evaluated two recent top-performing models: 
TCDN 1221 and CESS 1^ . Since these methods operate in 
the general framework of pose regression, performing iterative 
refinement of predicted landmark locations, they are sensitive 
to initial bounding box location. We tested both models us¬ 
ing the standardized detection bounding boxes provided by the 
300-W benchmark 1^ rather than tight cropping images to 
the ground-truth landmark locations. We used the pre-trained 
TCDCN model available online while for CESS we retrained 
the model using the standard detector bounding boxes. In both 
cases, average error was significantly worse than previously re¬ 
ported results, highlighting the sensitivity of these methods to 
initialization. 

Dependence of Localization on Detection. A key benefit of the 
HPM (and DPM 0) approach is that the same model serves to 
both detect and localize the landmarks. In contrast, pose regres¬ 
sion methods such as RCPR, TCDN or CESS require that the 
face already be detected. This distinction becomes particularly 
important for occluded faces since detection is significantly less 
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Recall 


(a) Occlusion prediction accuracy 



(b) Success rate vs. occlusion recall 



(c) Localization error vs. occlusions recall 


Figure 8: Occlusion prediction accuracy on the COFW test dataset for variants of our model. Using a suitable margin scaling function (see Sec. \4.2) allows for 
significantly better occlusion prediction accuracy (a) over an independent occlusion model (a) with minimal loss in localization performance (b,c). Localization 
performance of DPM and RCPR are included for reference. 


accurate (see Detection experiments below). 

To highlight the dependence of landmark localization on 
accurate detection, we benchmarked average localization error 
for varying degrees of overlap between the hypothesized detec¬ 
tion and ground-truth bounding box on the COFW test set. As 
shown in Fig. decreasing the overlap ratio has no affect HPM 
/ DPM performance since there are never false positives in the 
vicinity of the face that score higher than one with high overlap 
ratio. In contrast, RCPR performs significantly worse when ini¬ 
tialized from bounding boxes that do not have high overlap with 
the face. Since the area over which RCPR searches is learned 
from training data, we also retrained a version of RCPR for 
each degrees of overlap. This yielded improved performance 
but still shows a significant fall off in performance compared 
to the HPM. As noted above, we encountered similar behavior 
when evaluating other methods such as TCDNN and CFSS on 
realistic detector-generated bounding boxes. 

Dependence of Localization on Training data. One advantage 
of the HPM model is robustness to the choice of training data 
set. Table highlights a comparison of HPM and RCPR in 
which the training set is varied. HPM performs well on LFPW 
and COFW regardless of training set specifics. In contrast, 
RCPR shows better performance on COFW when the training 
data is also taken from COFW. Training data augmentation is 
also important to achieve good performance with RCPR, while 
HPM works well even when trained on the relatively smaller 
LFPW training set. 

5.2. Occlusion Prediction 

To evaluate the ability of the model to correctly determine 
which landmarks are occluded, we evaluate the accuracy of oc¬ 
clusion as a binary prediction task. For a given test set, we 
compute precision and recall of occlusion predictions relative 
to the ground-truth occlusion labels of the landmarks. 


For HPM, we trace out a precision-recall curve for occlu¬ 
sion prediction by adjusting the model parameters to induce dif¬ 
ferent predicted occlusions. As described in Sectionthe bias 
parameter bij{si, sj^Oi, oj) favors particular co-occurrences of 
part types. By increasing (decreasing) the bias for occluded 
configurations we can encourage (discourage) the model to use 
those configurations on test. Let bij{si, Sj^Oi^ Oj) be a learned 
bias parameter between an occluded leaf and its parent. To 
make the model favor occluded parts, we modify this param¬ 
eter to bij(si , Sj,Oi,Oj) + abs{bij (s^, sy, , Oj)) x a. 

Fig. [^a) depicts occlusion precision-recall curves gener¬ 
ated by running the HPM model for different bias a offsets. The 
crosses mark the precision-recall for the default operating point 
when a = 0. We compare performance of the HPM model 
with different values of the margin scaling hyper-parameter m 
as well as RCPR and a baseline independent occlusion model. 
Fig. (b) and (c) show the corresponding average errors and 
success rates for these models parameterized by the recall of 
occlusion. For large values of a, the model predicts more occlu¬ 
sions, resulting in improved recall at the expense of precision 
(a) and ultimately lower localization accuracy (b,c). 


Margin scaling. As described in section |4.2[ we can change 
the learning parameter m to produce models with different re¬ 
call of occlusions at the trained operating point (a = 0). When 
m = 0 all the negative examples including fully or partially oc¬ 
cluded configurations are penalized equally. Therefore, model 
learns small biases for occluded configurations, reducing the 
total loss over occluded negative examples and decreasing de¬ 
fault recall of occlusion. When driven to predict more occlusion 
by increasing a the model localization performance degrades 
rapidly. Training the model with larger values of m yields a 
model which naturally predicts occlusion more frequently and 
degrades more gracefully for larger values of a. We found that 
choosing a value of m = 0.5 provided a good compromise, 
improving both recall and localization accuracy. 


10 






















Figure 9: Face detection performance of HPM and state-of-the-art methods 
(40) on the continuous-ROC FDDB benchmark (5). 


Independent occlusion baseline. We compared the results of 
HPM with a model that had the same architecture but in which 
there are no occlusion mixtures at the part level and each land¬ 
mark is allowed to be independently set to visible or occluded 
depending on learned biases. We refer to this as “independent 
occlusion” since the model does not capture any correlations 
between the occlusion of different landmarks. We found that 
this independent occlusion model has many of the same bene¬ 
fits as the HPM model in terms of landmark localization accu¬ 
racy (Fig. [^. However, occlusion prediction accuracy is signifi¬ 
cantly worse in the independent model with precisions typically 
5% lower than HPM(m = 0.9) over a range of recall values. 

5.3. Detection 

Pose regression requires good initialization provided by a 
face detector to accurately locate landmarks. In contrast, part- 
based models have the elegant advantage of performing detec¬ 
tion and localization simultaneously. In this section, we com¬ 
pare the detection performance of our approach and other top 
methods on two datasets: FDDB 0 and our own Occluded 
Face Detection (UCI-OFD) dataset. 


the large model and small model are about 100 and 60 pixels 
respectively. To detect even smaller images, we upsample input 
images by a factor of 2 to allow for detection of faces as small 
as 30 pixels. We trained this model using the same 1758 posi¬ 
tive examples from HELEN68 and generated 8 virtual positive 
examples per example. For negative images we used 6000 im¬ 
ages from the PASCAL VOC 2010 train-val set which do not 
contain people. 

Detection on FDDB. We evaluated our multi-resolution model 
on the widely used FDDB dataset. This dataset contains 5171 
faces in a set of 2845 images. Faces are annotated by ellipses 
in this dataset and are as small as 20 pixels in height. To match 
that, we map our predicted landmark locations to ellipses using 
a linear regression model. FDDB has 10 folds and the ROC 
curves are the average over the results of these folds. To com¬ 
pute ellipses for each fold, we learned the linear regression co¬ 
efficients using examples from the other 9 folds. 

We used the standard evaluation protocol for this dataset 
and compared our method with the top published results avail¬ 
able on the FDDB website Eo). The continuous ROC curve for 
our method and leading methods are shown in Fig. [^plotted on 
a semi-log scale. Our result is highly competitive with the top 
results. The model has better performance on the continuous 
ROC evaluation relative to other methods because it can pre¬ 
dict location of parts and compute accurate bounding ellipses 
around the faces. 


UCI Occluded Face Detection Dataset (UCI-OFD). In order 
to better measure the ability of our model to handle detection 
of occluded faces, we assembled a preliminary dataset for oc¬ 
cluded face detection. This dataset and benchmarking code are 
publicly available]^ It consists of 61 images from Flickr con¬ 
taining 766 labeled faces. Of the faces in these images, 430 
include some amount of occlusion. Most of the faces are near 
frontal and vertical. Height (eyebrow to chin) of the smallest 
face is about 40 pixels. 

Precision/Recall curves of face detection of multi-resolution 
HPM, HPM, HPM -occ, DPM and Cascade DPM ll4Tll are shown 


in Fig. 10a). We further break down performance, plotting Pre¬ 
cision/Recall curves for the subset of faces with some amount 
of occlusion in (b) and fully visible in (c). Precision and recall 
for occluded subset of faces are calculated as below: 


Precision^ 


tPo 


tPo + fp 


, Recallo 


tPo 


tpo + frio 


Multi-resolution HPM. Since many face detection datasets such 
as FDDB contain many low-resolution faces, we trained a multi¬ 
resolution variant of our model 1^ . This model has a high and 
a low-resolution model for each viewpoint. The high resolution 
model has the same structure as our trained model for landmark 
localization except that parts are represented as 3x3 HoG cells 
rather than 5x5. The low-resolution model has 7 parts (right 
eye, left eye, nose, mouth, chin, left jaw and right jaw) each of 
which is represented by 7x7 HoG cells with the spatial bin size 
of 4. Each part has one shape mixture and 2 occlusion mix¬ 
tures (visible or occluded). The heights (eyebrow to chin) of 


where tpo and frio show number of correct detection and miss 
detection of occluded faces, respectively. Our method signifi¬ 
cantly outperforms other methods on the occluded subset and 
the performance of all of the methods are almost equal on the 
visible subset. Fig. EH shows example detection results pro¬ 
duced by the model on cluttered scenes containing many over¬ 
lapping faces. 


“https://github.com/golnazghiasi/ 
hpm-detection-code/tree/master/UCI_OFD 
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Figure 10: Precision-Recall curves of face detection on our UCI-OFD dataset (a) for all of the faces, (b) occluded subset and (c) visible subset. On the visible 
subset our model, DPM retrained on HELEN68 and Cascade DPM ED have almost similar performances, but our model significantly outperforms these methods 
on the occluded subset and it has a better overall performance. Cascade DPM uses many accelerate techniques, which may reject some of the faces. Its maximum 
recall for the visible faces is near 100%, while its maximum recall for the occluded faces is only 60%. The initial drop in the Precision-Recall of this method for the 
occluded subset is because its returned bounding boxes for some of the high scored occluded faces are not accurate and do not have the minimum 0.5 overlap with 
the ground-truth bounding boxes. 


6. Discussion and Conclusion 

Our experimental results demonstrate that adding coherent 
occlusion and hierarchical structure allows for substantial gains 
in performance for landmark localization and detection in part 
models. In images with relatively little occlusion, the HPM 
gives similar detection and localization performance to other 
part-based approaches, e.g. DPM, but is significantly more 
robust to occlusion. Our results also suggest that when it is 
useful to determine exactly which parts are occluded (e.g., for 
later use in face identification), independent occlusion makes 
weaker predictions than our part occlusion mixtures which en¬ 
force coherence between neighboring landmarks. While not 
specifically trained for landmark estimation, the final HPM is 
competitive with pose regression techniques in terms of land¬ 
mark localization accuracy on unoccluded faces (IBUG) and 
outperforms many such methods on occluded faces (Occluded 
HELEN, COFW). 

In comparing pose regression and part-based models, there 
seem to be several interesting trade-offs. In our experiments, 
we see a general trend in which error distribution curves for 
pose regression and part-based models cross, suggesting that 
pose regression yields very accurate localization for a subset of 
images relative to the HPM but fails for some other proportion 
even at very large error thresholds. Unlike pose regression, the 
part model performs detection, eliminating the need for detec¬ 
tion as a pre-process and improving robustness. In particular, 
we are able to detect many heavily occluded faces which would 
not be detected by a standard cascade detector and hence in¬ 
accessible to pose regression. We find that the HPM tends to 
generalize well across datasets suggesting it avoids some over¬ 
fitting problems present in pose regression. 

This fiexibility currently comes with a computational cost. 
The run-time of our model implementation built on dynamic 
programming lags significantly behind those of regression-based. 


feed-forward approaches. Our implementation takes ^lOs to 
run on a typical COFW image, roughly lOOx slower than RCPR 
or DCNN based approaches. However, the HPM is amenable 
to implementation on a GPU which may address most of this 
runtime gap. 

Finally, we note several avenues for future work. Perfor¬ 
mance depends on the graphical independence structure of the 
model which should ideally be learned from data. While our 
model implicitly represents the pattern of part occlusions, it 
does not integrate local image evidence for the occluder itself. 
A natural extension would be to add local filters that detect the 
presence of an occluding contour between the occluded and 
non-occluded landmarks. Such filters could be shared across 
parts to avoid increasing too much the overall computation cost 
while moving closer to our goal of explaining away missing 
object parts using positive evidence of coherent occlusion. 
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Figure 11: Examples of detection and localization for images from our UCI-OFD dataset (rows 1-2) and images containing occlusion from FDDB dataset (rows 
3-4). Detections indicated with only 7 landmarks correspond to responses from the low-resolution model component. Ellipses are predicted on FDDB images by 
linear regression from landmark locations to ellipse parameters. 
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